embedding and clustering
Embedding And Clustering Your Data Can Improve Contrastive Pretraining
Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
- North America > United States > Montana > Flathead County > Kalispell (0.14)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- North America > Canada (0.04)
- (18 more...)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (6 more...)
Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low di(cid:173) mensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to non(cid:173) linear dimensionality reduction that has locality preserving prop(cid:173) erties and a natural connection to clustering. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high di(cid:173) mensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However, the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 .
Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
Belkin, Mikhail, Niyogi, Partha
Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
Belkin, Mikhail, Niyogi, Partha
Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifoldembedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionalityreduction that has locality preserving properties and a natural connection to clustering.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)